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Endogenous retrotransposons have caused extensive genomic variation within mammalian species, but the functional 
implications of such mobilization are mostly unknown. We mapped thousands of endogenous retrovirus (ERV) germline 
integrants in highly divergent, previously unsequenced mouse lineages, facilitating a comparison of gene expression in the 
presence or absence of local insertions. Polymorphic ERVs occur relatively infrequently in gene introns and are par- 
ticularly depleted from genes involved in embryogenesis or that are highly expressed in embryonic stem cells. Their 
genomic distribution implies ongoing negative selection due to deleterious effects on gene expression and function. A 
polymorphic, intronic ERV at SIcl5a2 triggers up to 49-fold increases in premature transcriptional termination and up to 
39-fold reductions in full-length transcripts in adult mouse tissues, thereby disrupting protein expression and functional 
activity. Prematurely truncated transcripts also occur at Polrla, Sponl, and up to -5% of other genes when intronic ERV 
polymorphisms are present. Analysis of expression quantitative trait loci (eQTLs) in recombinant BxD mouse strains 
demonstrated very strong genetic associations between the polymorphic ERV in cis and disrupted transcript levels. Pre- 
mature polyadenylation is triggered at genomic distances up to >12.5 kb upstream of the ERV, both in cis and between 
alleles. The parent of origin of the ERV is associated with variable expression of nonterminated transcripts and differential 
DNA methylation at its 5'-Iong terminal repeat. This study defines an unexpectedly strong functional impact of ERVs in 
disrupting gene transcription at a distance and demonstrates that ongoing retrotransposition can contribute significantly 
to natural phenotypic diversity. 



[Supplemental material is available for this article.] 

The laboratory mouse is the premier model organism, facilitating 
comparative studies of human diseases, development, and natural 
variation. Numerous distinct mouse lineages manifest phenotypic 
differences such as various coat colors and differential suscepti- 
bilities to diseases (Wade and Daly 2005). The molecular basis for 
natural phenotypic variation or allele-specific expression differ- 
ences remains unclear in most cases, although recent studies have 
associated differential gene expression with various forms of struc- 
tural variation in mouse and human genomes (Yan et al. 2002; 
Adams et al. 2005; Yang et al. 2007; She et al. 2008; Cahan et al. 
2009; Schlattl et al. 2011; Yalcin et al. 2011). 

Transposons are a potentially major but relatively unex- 
amined determinant of such allele-specific, transcriptional varia- 
tion. They are strong candidates for regulating or disrupting gene 
expression since they comprise almost half of the mouse genome 
and certain elements are still actively mobilized. Approximately 
10% of naturally occurring mutations in the mouse have been 
attributed to insertional mutagenesis of coding sequences due to 
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endogenous retrotransposition (Waterston et al. 2002). We re- 
cently showed that thousands of polymorphic retrotransposon 
integrants of various active classes are present in the C57BL/6J 
(hereafter abbreviated as B6) reference genome but absent from 
one or more other classical inbred mouse lineages (i.e., A/J; DBA/2J, 
DBA; 129Sl/SvImJ, 129S1; and 129Xl/SvJ, 129X1) (Akagi et al. 
2008). Of these, new endogenous retrovirus (ERV)-K integrants 
may be particularly capable of altering transcriptomes in diverse 
tissues (van de Lagemaat et al. 2003; Medstrand et al. 2005). Mem- 
bers of various ERV families make up —10% of the mouse genome. 
While most such genomic elements are ancient and are comprised 
of solo long terminal repeats (LTRs), the ERV-K family includes 
a significant number of young full-length elements flanked by 
virtually identical LTRs. Of these, intracisternal A-particle (IAP) 
retrotransposons are still capable of autonomous mobilization and 
are transcriptionally activated by genome-wide cytosine demeth- 
ylation (Walsh et al. 1998), contributing to tumor formation 
(Howard et al. 2007). Approximately 1000 of these elements contain 
intact open reading frames (ORFs). They have long been known to 
be active and polymorphic in various mouse lineages and tumors 
(Shen-Ong and Cole 1982; Lueders et al. 1993; Zhang et al. 2008). 
To simplify nomenclature, we refer to these IAP retrotransposons 
as ERVs. 
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Well-characterized retrotransposon integrants that alter gene 
expression and mediate phenotypic variability include the dilute 
and hairless coat color mutations (Copeland et al. 1983; Stoye et al. 
1988). In these cases, intronic murine leukemia virus (MLV) in- 
sertions cause aberrant splicing of overlapping gene transcripts. 
MLV sequences are incorporated directly at the 3 ' ends of disrupted 
transcripts, which are then prematurely terminated (Seperack et al. 
1995; Cachon-Gonzalez et al. 1999). In contrast, ERV (IAP) integrants 
upstream of or within the A (i.e., agouti) andAxinl (i.e., axin 1) genes 
inserted active, heterologous promoters in the resulting agouti 
viable yellow (A vy ) and axin fused (Axinl Fu ) alleles, resulting in 
epigenetically regulated, variable initiation of downstream fusion 
transcripts (Morgan et al. 1999; Whitelaw and Martin 2001). Full- 
length ERV integrants also can affect neighboring gene tran- 
scription by direct incorporation of polyadenylation signal se- 
quences and/or binding sites for transcription factors (van de 
Lagemaat et al. 2003; Medstrand et al. 2005), as was observed 
recently for Adamtsl3 (i.e., a disintegrin-like and metal- 
lopeptidase [reprolysin type] with thrombospondin type 1 motif, 
13), a von Willebrand factor-cleaving protease disrupted by an 
intronic ERV integrant (Banno et al. 2004; Zhou et al. 2007). 
However, until now, other effects by ERV polymorphisms have 
not been characterized. To associate variable gene expression 
levels with the presence or absence of local ERV insertions, we 
mapped ERVs in several highly divergent mouse lineages. We 
then identified and characterized numerous cases in which gene 
transcripts are strongly and differentially 
disrupted at a distance by nearby ERV 
polymorphisms. 



Results 

Identification of polymorphic ERVs 
in diverse mouse strains by transposon 
junction assay 

To study possible effects of ERV integrants 
on neighboring gene expression levels, 
first we mapped such integrants in previ- 
ously unsequenced, diverse mouse strains, 
without prior knowledge of their location 
or polymorphism status. Recently, vari- 
ous methods to find ERV insertions, both 
polymorphic and nonpolymorphic, have 
been described (Horie et al. 2007; Akagi 
et al. 2008; Takabatake et al. 2008; Zhang 
et al. 2008; Qin et al. 2010; Ray et al. 
2011). We developed and optimized a 
sensitive, high-resolution genomic map- 
ping assay using PCR and 454 Life Sci- 
ences (Roche) sequencing, which we call 
the transposon junction assay (Supple- 
mental Fig. 1; Pornthanakasem and 
Mutirangura 2004; Iskow et al. 2010; 
Witherspoon et al. 2010). Using trans- 
poson sequence-specific and degenerate 
primers for genomic PCR amplification, 
we targeted members of certain young 
ERV families including IAPLTR1, IAPLTR2, 
and IAPEY2 elements (Kapitonov and 
Jurka 2008) since they are anticipated to 
be polymorphic (Qin et al. 2010) in diverse 



mouse lineages. Based on their features observed in the reference 
B6 genome, the IAPLTR1 integrants are most likely to be full length 
and to have identical LTRs (data not shown), consistent with their 
status as the youngest ERV insertions (Qin et al. 2010). 

We optimized the transposon junction assay using various 
combinations of restriction enzymes and degenerate primers 
(Supplemental Table 1). This method reidentified 1538 out of 1665 
(92.4%) of the youngest mappable ERV (IAPLTR1) integrants in the 
reference B6 genome at ~ 14-fold sequencing coverage. To validate 
these results, we compared chromosomal distributions of identi- 
fied ERVs with previously annotated reference elements (Supple- 
mental Fig. 1). Overall, the correlation between local densities of 
reference versus resequenced transposon integrants is excellent, 
particularly for IAPLTR1 elements (p < 2.2 X 10" 16 ). Pearson's 
correlation coefficients are 0.75 (IAPLTR1), 0.78 (IAPLTR2), and 
0.60 (IAPEY2). Thus no significant global bias was detected in 
identifying ERVK elements by targeted resequencing. 

We used this assay to define the genomic locations of young 
ERVs in six diverse mouse lineages, i.e., A/J, B6, CAST/EiJ (CAST), 
MOLF/EiJ (MOLF), SPRET/EiJ (SPRET), and WSB/EiJ (WSB); 25,069 
integrants were identified (Supplemental Table 1). The chromo- 
somal integration sites of IAPLTR1 integrants are almost entirely 
different in comparing the highly divergent, wild mouse lineages 
(SPRET, CAST, and MOLF), because only four out of several thou- 
sand IAPLTR1 integrants are present at orthologous loci (Fig. 1). 
This result strongly suggests that the rare, shared integrants are 
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Figure 1 . Genomic variation due to ERVs in diverse mouse strains. (A,B) Venn diagrams indicating counts 
(n) of shared versus distinct ERV elements at individual integration sites (orthologous locations) in previously 
unsequenced mouse strains. The youngest IAPLTR1 (A) and older IAPEY2 (B) elements were compared at 
genomic insertion sites in related B6, A/J, and WSB (top) and in divergent B6, CAST, and SPRET (bottom) 
mouse strains. MOLF integrants are not presented here. Only four of several thousand youngest IAP integrants 
occur at orthologous loci in the most divergent strains (lower left). (C) Genome-wide distributions of ERV 
polymorphisms in diverse mouse lineages. Histograms display the numbers of strains containing polymorphic 
ERVs at orthologous loci within 10-MB genomic bins for (top) IAPLTR1; (middle) IAPLTR2; and (bottom) 
IAPEY2 elements. (Dark blue) Integrants present in one strain; (light blue) two strains; (purple) three; (light 
pink) four; (orange) five; (red) all six. (Bottom) Mouse chromosomes 1-19, X, and Y (alternating shading). 
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identical by descent (Salem et al. 2005; Ray et al. 2011) and reveals 
extensive, lineage-specific retrotransposition. As expected, they 
are conserved at higher frequencies at orthologous loci in the re- 
lated classical lines (B6, A/J, and WSB). Additionally, our bio- 
informatics analysis identified thousands of additional, previously 
unreported ERV polymorphisms that are present in one or more of 
the nonreference "Celera strains/' i.e., A/J, DBA, 129S1, and 129X1 
mice (Akagi et al. 2008) but absent from the B6 reference genome. 
Very similar proportions of polymorphic retrotransposon families 
including ERVs are present or absent in these strains, regardless of 
the genome chosen for comparison (Supplemental Fig. 2). 

To compare the chromosomal distributions of polymorphic 
versus nonpolymorphic ERVs in various lineages, we plotted the 
number of elements counted in 10-Mb genomic intervals in single 
versus additional strains (Fig. 1C). IAPLTR1 integrants are mostly 
present in only one of the six diverse strains studied here; they are 
highly polymorphic. When their counts 
are summed up across the different 
strains, the polymorphic integrants are 
quite uniformly distributed across the 
genome, without large hotspot or de- 
sert regions in the chromosomes. In con- 
trast, older IAPLTR2 and particularly 
IAPEY2 integrants tend to be more non- 
polymorphic in multiple diverse strains. 
Both reference and polymorphic ERV 
integrants are more uniformly distributed 
across the genome than are reference and 
polymorphic LI retrotransposons (Sup- 
plemental Fig. 2; Akagi et al. 2008). 

We validated a collection of ERV 
integrants identified here, by amplifying 
occupied or empty genomic target sites in 
up to 21 diverse mouse lineages using 
PCR (Supplemental Fig. 1; Supplemental 
Tables 2, 3). The results demonstrate that 
both the transposon junction assay and 
our analysis of Celera sequence traces 
(Akagi et al. 2008) accurately determine 
the presence of integrants, and confirm 
that ERV insertions are extremely poly- 
morphic (Zhang et al. 2008). Recent 
whole-genome shotgun (WGS) sequenc- 
ing of 17 mouse strains has facilitated 
identification of thousands more retro- 
transposon polymorphisms (Keane et al. 
2011; Yalcin et al. 2011). We compared 
ERVs mapped by the transposon junction 
assay, PCR validation, and WGS pre- 
dictions. Out of 140 verifiable integrants 
called by the transposon junction assay, 
135 (>96%) were validated both by PCR 
and by WGS sequencing (Supplemental 
Table 2). In four of the five discrepant 
cases, our transposon junction assay and 
confirming PCR indicated empty target 
sites, but WGS demonstrated the pres- 
ence of ERV integrants. These cases in- 
dicate that the genomic DNA samples 
used in our assays versus WGS sequenc- 
ing are likely to include bona fide se- 
quence differences of unknown cause. 



Genomic distribution of polymorphic ERVs suggests 
deleterious impact on gene expression 

We assessed the genomic locations of young ERV polymorphisms 
relative to annotated genes, since their existing distribution would 
reflect insertion preferences and/or losses of deleterious elements. 
All classes of young ERV polymorphisms occur in gene introns at 
lower densities than expected from a simulated pattern of in- 
sertions due solely to chance (Fig. 2; Table 1). They are even more 
strongly depleted from particular genes involved in embryogenesis 
and/or highly expressed in embryonic stem (ES) cells (Fig. 2; 
Mikkelsen et al. 2007). The oldest ERVs mapped here, IAPEY2 el- 
ements, occur at even lower densities in intragenic locations than 
younger IAPs such as LTR1 and LTR2. Of the ERVs within genes, 
~72%-83% are oriented antiparallel to the sense (coding) strand of 
the genes, rather than the 50/50 orientation frequency expected if 
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Young ERVs are excluded from introns, particularly from embryogenesis and highly 
expressed genes. (A) "Observed" ERV integrant counts are plotted as percentages of "expected" counts 
within all gene introns (black histograms) or embryogenesis genes (gray). Genomic locations of 
various classes of ERVs (x-axis) were identified in diverse mouse lineages. Expected counts were de- 
termined by random simulation of 2 million insertion sites across the reference genome. By chance, 
— 35% of ERV insertions would be expected to fall within RefSeq gene introns, and —2.7% of all 
insertions would fall within embryogenesis genes, defined by the Mouse Genome Informatics database 
(http://www.informatics.jax.org). This normalization corrects for gene lengths. Percentages <100% 
signify relative exclusion of certain ERV subtypes from particular gene categories. (B) Based on their 
expression levels in mouse ES cells measured by microarrays (Mikkelsen et al. 2007), genes were binned 
into eight groups ranked from 1 (lowest expression) to 8 (highest), each with roughly equivalent 
numbers of genes expressed at comparable levels. Ratios of the observed numbers of genes containing 
intronic ERV integrants versus the expected number of genes identified by random simulation are 
presented (Brady et al. 2009) for different classes of ERV integrants (key, upper right). (Dashed line) Ratio = 1 
signifies equivalence between observed and expected counts; ratios < 1 signify relative exclusion of ERV 
integrants from particular groups of genes. 
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Table 1. Young ERVs occur at low densities in gene introns 



Number of 
elements 



Number of 
intronic 
elements 



» intronic 



Number of 
sense oriented, 
intronic 



% sense 
oriented, 
intronic 



Reference ERV class 
IAPLTR1 
IAPLTR2 
IAPEY2 
All ref IAP 
Simulation 



4410 
5772 
1112 
11,661 
2,000,000 



Nonreference ERV class 
IAPLTR1 2789 
IAPLTR2 3494 
IAPEY2 501 
All non-ref IAP 7002 
Simulation 2,000,000 



961 
901 
123 
2026 
700,262 



622 
585 
58 
1291 
700,262 



21 .8% 
15.6% 
11.1% 
1 7.4% 
35.0% 



22.3% 
16.7% 
1 1 .6% 
1 8.4% 
35.0% 



256 
235 
21 
518 
349,896 



164 
161 
15 
342 
349,896 



26.6% 
26.1% 
17.1% 
25.6% 
50.0% 



26.4% 
27.5% 
25.9% 
26.5% 
50.0% 



Summary of ERV elements identified in (top) B6 reference genome and (bottom) from combined 
analysis by transposon junction assay and bioinformatics analysis of A/J, DBA, 1 29S1 , 1 29X1 , WSB, 
SPRET, CAST, and MOLF mouse lineages. (Simulation) As a control, we simulated ERV insertions 
randomly throughout the reference genome. The results show strong biases against intronic in- 
sertions, particularly of the older IAPEY2 elements, and against elements positioned in the same 
orientation as gene open reading frames (i.e., sense orientation). 



integration and retention of ERVs were due to chance (Table 1). 
This orientation bias has been observed previously (Smit 1999; 
Medstrand et al. 2002; van de Lagemaat et al. 2006; Zhang et al. 
2008). It plausibly could reflect patterns of de novo integration, 
but new ERV integrants in mouse, human, and chicken do not 
display such orientation bias (Dewannieux et al. 2004; Barr et al. 
2005; Brady et al. 2009). The relative lack both of genomic ERV 
integrants within transcribed genes and of sense-oriented in- 
tragenic integrants presumably reflects strong ongoing purifying 
selection against them, due to their putative deleterious conse- 
quences on gene expression, structure, and function. Genes that 
lack such ERV integrants across all lineages are likely to be func- 
tionally essential in early development and viability of the or- 
ganism. Alternatively, current patterns of insertions could reflect 
de novo integration preferences, but this possibility is refuted by 
the finding that older ERV integrants occur at even lower densities 
in gene introns. Together, these results strongly suggest that ERV 
integrants can exert deleterious effects on gene expression and 
function, and therefore the remaining extant integrants are rela- 
tively absent from gene introns. 

An intronic ERV in SIcl5a2 disrupts transcription, protein 
expression, and function 

The identification of thousands of polymorphic, intronic ERVs in 
diverse mouse strains provided a unique opportunity to assess their 
roles in transcriptional and functional variation, by comparing 
gene expression and functions in strains with and without such 
individual integrants. We first screened a collection of genes con- 
taining intronic ERVs by using reverse transcriptase-mediated 
polymerase chain reaction (RT-PCR) to identify fusion transcripts 
initiated from upstream polymorphic ERV promoters (Wheelan 
et al. 2005). We identified spliced downstream fusion transcripts 
initiated from intronic ERVs in Slcl5a2 (i.e., solute carrier family 
15 [H + /peptide transporter], member 2) and Polrla (i.e., poly- 
merase [RNA] I polypeptide A) among others, expressed in a tissue- 
specific manner (data not shown). We designate such polymorphic 
integrants as ERV 5ZcI5a2 , ERV PoZrJa , etc. 



To assess whether additional Slcl5a2 
transcriptional variants are associated with 
the presence or absence of ERV SZcJ5a2 , we 
probed RNA blots using cDNA probes gen- 
erated from both 5' and 3' ends of conven- 
tional, full-length Slcl5a2 transcripts (Fig. 3). 
Contrary to our initial RT-PCR results, the 
ERV-Slcl5a2 downstream fusion transcript 
is not abundantly expressed, as shown by 
Northern blot using a 3' probe (Fig. 3 A). 
However, both 5' and 3' probes demon- 
strated that the presence of ERV SZci5a2 in in- 
tron 7 is strongly associated with premature 
transcriptional termination, resulting in up 
to 39-fold reductions in full-length 4-kb 
transcripts and concomitant increases of up 
to 13 -fold or more of prematurely truncated 
1.2-kb transcripts (Fig. 3 A). These very signifi- 
cant changes in transcript structures and levels 
were observed in several tissues including 
brain and kidney. Notably, the prematurely 
truncated transcripts are expressed strongly 
only in strains harboring ERV SZcI5a2 , con- 
firmed by quantitative reverse transcriptase- 
mediated PCR (qRT-PCR) and expression microarray assays (Sup- 
plemental Fig. 3). 

Slcl5a2 encodes PEPT2, a well-studied transporter of peptide- 
like molecules that is expressed in mouse brain, choroid plexus, 
kidney, lung, breast, and eye. PEPT2 has significant biological 
importance since it transports pharmacologic agents such as beta- 
lactam antibiotics in the kidney and brain (Brandsch et al. 2008), 
protects against 5 -aminolevulinic acid neurotoxicity (Hu et al. 
2007), and reduces the analgesic effect of L-kyotorphin Qiang et al. 
2009). While Slcl5a2 experimental knockout mice are viable and 
fertile, their physiological transport of certain oligopeptides is 
disrupted (Shen et al. 2003; Smith et al. 2011). 

The ERV-associated transcriptional disruption results in ap- 
proximately threefold to ninefold reductions in protein expression 
in each individual B6 mouse tested, when compared with DBA/2J 
individuals that lack ERV SZci5a2 (Fig. 3B). Resulting PEPT2 func- 
tional peptide transport activity is also significantly reduced in 
B6 brain and lung compared with DBA/2J tissues (Fig. 3C). Thus, 
significant functional variation between the mouse strains is 
strongly associated with the presence or absence of an intronic 
ERV integrant within this genetic locus. 

The truncated Slcl5a2 transcript also was detected in strains 
lacking ERV SZcJ5a2 , albeit at very low levels, upon prolonged ex- 
posure of Northern blots (Fig. 4A). This prematurely truncated 
transcript does not terminate inside ERV SZci5a2 itself and includes 
no sequences templated by that ERV, in contrast to transcripts 
terminated within other intragenic transposable elements (Zhou 
et al. 2007; Zhang et al. 2008; Li et al. 2010). Instead, 3' rapid 
cloning of cDNA ends (rapid amplification of cDNA ends, RACE) 
experiments using adult brain total RNAs demonstrated that 
truncated Slcl5a2 transcripts stop —1.5 kb upstream of ERV SZci5a2 , 
distal to the splice donor site at the 3' end of exon 7 (GenBank 
accession numbers JF495121-JF495122) (Fig. 4B). Their 3' ends 
precisely match transcripts previously identified in mammary tis- 
sue, tumor, and in day 16 embryos from B6 and FVB mouse strains 
(GenBank accession numbers NM_001 145899, BC018335, 
AK018393, and BC051199). Notably, two weak predicted poly- 
adenylation signal sequences (Beaudoing and Gautheret 2001) 
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Figure 3. An intronic ERV polymorphism disrupts Slcl5a2 expression and function. (A) Northern 
blots. Equivalent amounts (1 0 meg each) of total RNAs from brains pooled from several individuals from 
the indicated lineages were electrophoresed. Northern blots were probed with 5' (left) and 3' (right) 
probes from Slc15a2. (Left) Truncated transcripts (1.2 kb, arrow) correlate with the presence of 
a polymorphic ERV in B6, 129S1, and 129X1 strains but absent from the others. The full-length (non- 
terminated, 4 kb) Slcl5a2 transcript is expressed robustly in the absence of the ERV integrant in A/J and 
DBA mice. (Right) No appreciable downstream fusion transcript (2 kb) was detected, although it was 
identified by qRT-PCR (data not shown). Loading controls are shown in Supplemental Figure 3A. (B) 
Western blots. Protein extracts from individual brains (left) and lungs (right) from B6 and DBA mice 
were electrophoresced and probed for PEPT2 using protein-specific antiserum. (C) Functional assay in 
vivo. Accumulation of radiolabeled Gly-Sar dipeptide substrate was measured in choroid plexus and 
lung from B6 versus DBA mouse lineages, indicating significantly different PEPT2 functional activities 
(asterisks). 



we could assess relationships between 
SNPs genome-wide and variable Slcl5a2 
transcription by considering both trun- 
cated and full-length transcripts as eQTLs. 
The results demonstrate a very strong as- 
sociation between the ERV s/ci5a 2-positive 
haplotype (as approximated by the clos- 
est informative SNP, rs41 73858) and 
differential Slcl5a2 expression, i.e., both 
truncated and full-length transcripts 
(Fig. 5). Almost all BxD RI lines that are 
ERV 5 / CJ5a2 -positive express significantly 
more truncated Slcl5a2 transcript and 
significantly less full-length transcripts 
(Fig. 5B, bottom, cf. probe sets 1, 2, and 3). 
A few discrepant BxD lineages have SNP 
genotypes that appear to contradict the 
Slcl5a2 expression levels. These appar- 
ent discrepancies each were resolved by 
checking the absence/presence status of 
ERV 5 / Cj5a 2 (Supplemental Fig. 4; data not 
shown), rather than the adjacent SNP 
surrogate. Thus the ERV genotypes are 
all strongly correlated with the expres- 
sion levels measured in each BxD RI 
strain. 

We resequenced 1 kb upstream of and 
downstream from the premature termina- 
tion site in multiple mouse strains (data 
not shown), disclosing only a single, pre- 
viously identified, nonsynonymous SNP 
within exon 6 that does not correlate ei- 
ther with differential Slcl5a2 expression 
or with the polymorphic ERV S / Cj5a2 in- 
tegrant. Moreover, we compared 335 kb 
of adjacent genomic sequences in B6 ver- 
sus DBA/2J wild-type genomes, thereby 
identifying 42 SNPs and seven small indel 
polymorphisms. None of these variants, 
other than ERV S / Ci5fl 2 itself, are located 
inside of known coding genes; they each 
are upstream, downstream, or within gene 
introns, or within noncoding genes. 
None are classified as deleterious. Thus 
we conclude that ERV 5Zci5a2 itself is the 
genetic determinant of variable tran- 
scription of Slcl5a2 in cis. 



occur in intron 7, i.e., 5'-GATAAA and 5'-ATTAAA, immediately 
downstream from exon 7 and and upstream of the premature, 
added poly (A) tails (Fig. 4B). 

To strengthen the genetic association between ERV 5 / Ci5a2 
and transcriptional disruption further, we analyzed Slcl5a2 ex- 
pression data collected in kidneys from 53 distinct recombinant 
inbred (RI) mouse strains. These mice were derived from inter- 
crosses of B6 and DBA/2J lines (B6 X DBA intercrossed mouse, 
BxD), resulting in a panel of highly recombinant mice with homo- 
zygosity at virtually every genetic locus, facilitating the identifica- 
tion of the genetic determinants of expression quantitative trait 
loci (eQTLs) (Chesler et al. 2005). Since B6 and DBA wild- type 
mice do and do not contain the ERV SZci5fl 2 integrant, respectively, 



Effects of the heterozygous ERV's parent of origin 

To assess possible consequences of ERV 5fci5a2 heterozygosity on 
Slcl5a2 expression, we reciprocally crossed homozygous strains 
with (B6) and without (CAST/EiJ, CAST) this ERV, respectively. 
These intercrosses resulted in F x offspring with ERV 5Zci5fl 2 inte- 
grants having either parent of origin. Nonterminated (i.e., pre- 
sumably full-length) Slcl5a2 transcripts are significantly reduced 
in heterozygous CAST X B6 F x offspring with paternally derived 
ERVs/ci5a2 (Fig- 6A,B). This reduction is comparable to that ob- 
served in homozygous B6 mice, rather than an intermediate level 
of expression as predicted from heterozygosity of the ERV. Thus, 
terminated Slcl5a2 transcript expression is strongly associated 
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Figure 4. Transcriptional termination occurs at pre-existing signal upstream of ERV. (A) Prematurely terminated transcripts are present at low levels 
(arrows) in kidneys of mouse strains lacking the polymorphic ERV, indicating that the premature transcriptional polyadenylation signal exists both in strains 
that have or lack the ERV, and is not templated by the ERV per se. (B, top) Schematic of the Slc15a2 locus showing site of premature transcriptional 
termination ~1 .5 kb upstream of the intronic ERV polymorphism present in the B6 strain, in intron 7. (Bottom) Sequence trace from 3'-RACE experiment, 
demonstrating that the 3' end of the prematurely truncated transcript is polyadenylated ~1 .5 kb upstream of the ERV and contains no ERV-templated 
sequences perse. (Red arrows) Weak pre-existing polyadenylation signals (i.e., 5'-GATAAAand ATTAAA) are present in the intron, immediately upstream 
of the added poly(A) tail. GenBank accession numbers JF4951 21-JF4951 22. 



with the introduced intronic ERV; one (ERV + ) allele can affect ex- 
pression from the other (ERV~). In contrast, F x offspring with the 
maternally derived ERV sfci5a2 allele exhibit robust expression of 
nonterminated Slcl5a2 transcripts (Fig. 6A,B). In both crosses, we 
observed expression of both alleles at approximately equivalent 
levels (Supplemental Figs. 3 and 5). Thus the parent of origin of the 
ERV S / Ci5fl 2 polymorphism affects the expression levels of non- 
terminated Slcl5a2 transcripts in the offspring, and transcriptional 
disruption can occur between alleles. 

In contrast to differential expression of full-length transcripts, 
the prematurely truncated 1.2-kb transcript is detected at approx- 



imately equivalent, high levels in all mice that contain the ERV, 
much more than in strains lacking it (Fig. 6A,C). Notably, in some 
cases, the reduced expression of full-length transcripts is not cor- 
related inversely with increased expression of prematurely trun- 
cated transcripts. 

We sought to compare Slcl5a2 expression levels in individual, 
age-matched mice with the same ERV 5 / Ci5a2 genotypes but derived 
from different genetic backgrounds. Thus we set up additional 
genetic crosses of wild-type and F x mice on both B6 and CAST 
genetic backgrounds, resulting in individual F x and F 2 offspring 
with all possible homozygous or heterozygous ERV 5/Cj5a 2 geno- 
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Figure 5. Strong genetic associations between transcriptional disruption and ERV S / C 7 5o 2 status in c/'s. (A) eQTL permutation analysis indicates a very 
strong association between a SNP (rs41 73858) genotype, which serves as a surrogate for ERV S / c75a 2 ~ 137 kb distant, and expression of the Slc15a2 
truncated transcript in mouse recombinant inbred BxD strain kidneys. (Red line) The chromosomal position of Slc15a2; (/-axis) P-values were calculated 
for the association between each SNP at the indicated chromosomal coordinates and truncated Slcl5a2 transcript levels. (B, top) Schematic of Affymetrix 
microarray probe sets detecting (1, 2) truncated or (3) full-length transcripts. (Bottom) Individual expression data (x-axis, log scale) measured by 
microarray probe sets (1-3) for each recombinant inbred BxD strain with indicated SNP genotypes: (red) B6; (blue) DBA; (black) heterozygous or 
indeterminate. (C) Box plots showing log of transcript expression versus genotypes: (B) B6; (D) DBA/2J. Error bars indicate SD. P-values for expression 
differences between genotypes B and D were calculated using a t-test: probe 1 = 1 .80 x 10~ 22 ; probe 2 = 5.53 x 10~ 23 , and probe 3 = 4.58 x 10" 10 . 
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Figure 6. Transcriptional termination occurs between alleles in Ft and F 2 mice. (A, top) Northern blot 
demonstrating differential reduction in full-length transcripts in brains from CAST x B6 but not B6 x 
CAST Ft hybrid with heterozygous ERV integrants. In contrast, truncated transcripts (arrows) are 
detected in both lineages. (Bottom) Loading control showing 28S and 1 8S rRNA. Comparable amounts 
(1 0 meg) of total RNA were loaded in each lane. (B) Quantitative RT-PCR assay for full-length transcripts 
(extending past exon 7) in brains from various mouse strains. Results are expressed as the fold change in 
levels relative to the sample with the lowest concentration. (C) Quantitative RT-PCR assay for the 3' end 
of prematurely truncated transcripts shows that their expression is boosted specifically in strains con- 
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(each in duplicate or triplicate) in individual mice with indicated genotypes. Results were normalized to 
Hprt (i.e., hypoxanthine guanine phosphoribosyl transferase) transcript expression. (Error bars) Range of 
data. Numbers at top are identifiers for individual mice (Supplemental Table 3). 



types where the allelic parents of origin are known unambiguously. 
We quantified both nonterminated and truncated Slcl5a2 tran- 
scripts in individual whole-brain extracts using qRT-PCR (Fig. 6D; 
Supplemental Fig. 5; Supplemental Table 4). Consistent with the 
results presented above (Fig. 3; Supplemental Fig. 3), nonterminated 
transcripts are significantly reduced in the presence of ERV S/Cj5a 2, 
up to ~ 16-fold, compared with its absence. Nonterminated tran- 
script levels also are significantly lower in individuals with the 
paternally derived ERV 5/Ci5a2 , compared with its maternal in- 
heritance. Prematurely truncated transcripts are expressed ro- 
bustly whenever the ERV is present and are increased further 
when ERV S / Ci5a2 is present in homozygosity, i.e., up to ~49-fold 



(Fig. 6D). The results also demonstrate 
relatively modest variability between 
individuals with the same ERV geno- 
type at Slcl5a2, regardless of their diverse 
ancestries. Thus, neither ancestral expo- 
sures to the ERV nor unlinked, distant 
genetic modifiers alter these ERV-medi- 
ated effects substantially. 

Since the nonterminated transcript 
levels reflect the parent of origin of the 
intronic ERV, we were prompted to assess 
DNA methylation at ERV 5Zci5fl 2 in various 
heterozygous and homozygous mice. We 
observed differential methylation that is 
associated with the ERVs parent of origin 
(Fig. 7). Its 5' long terminal repeat (LTR), 
closer to upstream Slcl5a2 exon 7, is rel- 
atively hypomethylated in B6 X CAST Fl 
mice, with only —50% CpGs methylated. 
The 5 ' LTR is more densely methylated in 
B6 (74%) and particularly in CAST X B6 
F x (91%) mice. In contrast, the 3' LTR is 
densely methylated (95%- 100%) in all 
lineages tested (Fig. 7). Increased meth- 
ylation at the 5' LTR is associated with 
decreased levels of the nontruncated 
transcripts. 

Disruption of other genes by 
polymorphic, intronic ERVs 

We asked whether intronic ERVs in other, 
independent genes could disrupt their 
expression similarly. Using RT-PCR, we 
identified prematurely truncated tran- 
scripts at Polrla and Sponl (i.e., spondin 1). 
The differential expression of truncated 
transcripts again correlates precisely with 
the presence or absence of intronic ERVs 
acting at a distance (Fig. 8; Supplemental 
Fig. 6; truncated Polrla transcript, Gen- 
Bank accession number AK087773.1). 
The polymorphic ERVs are oriented ei- 
ther parallel or antiparallel, respectively, 
relative to the genes' reading frames, in- 
dicating that transcriptional termination 
can be triggered independent of the ERVs 
orientation. While downstream fusion 
transcripts are robustly expressed in the 
case of ERV Po i rla (data not shown), such 
expression is not necessary for premature truncation of the over- 
lapping gene (as demonstrated at Slcl5a2) (Fig. 3). 

Polrla nonterminated transcripts are significantly reduced 
with paternally derived ERV PoZrJa when compared with mater- 
nally derived ERV PoZrifl (Fig. 8D). This is similar to the associa- 
tion between reduced expression of nonterminated (full-length) 
Slcl5a2 and the paternally derived ERV 5/ci5fl 2. Moreover, trun- 
cated transcripts are expressed at approximately equivalent 
levels from both alleles in offspring from both reciprocal crosses 
(Supplemental Fig. 6). Nonterminated transcripts of Sponl dis- 
play biallelic expression, regardless of the parent of origin of 
ERV Sp onl • 



ERV-/- 
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Figure 7. Differential methylation at ERV s/c7 5a2 reflects its parent of origin. DNA methylation at left (L) 5' and right (R) 3' LTRs of ERV S / C 7 5a 2 was assessed 
using bisulfite sequencing of genomic DNA purified from brains of indicated mouse lineages. (Top, schematic) In amplicon L, primers DES2652 and 
DES4883 yielded a 272-nt genomic DNA fragment to assess the methylation status of 1 1 CpG dinucleotides (circles) presented for multiple cloned alleles 
(horizontal lines). In amplicon R, primers DES4881 and DES2649 yielded a 304-nt fragment to assess eight CpGs. (Filled circles) Methylated cytosine; 
(open) unmethylated. (Upper left corner of each panel) Percentages of cytosines that are methylated. 



We surveyed the mouse transcriptome for additional candi- 
date genes whose expression may be disrupted by intronic ERVs 
<10 kb away from upstream exons. In addition to reidentifying 
premature transcriptional termination at Slcl5a2, Polrla, and 
CdkSrapl (i.e., Cabp, CDK5 regulatory subunit associated protein 1) 
(Druker et al. 2004), this bioinformatics screen identified more 
than 100 independent genes including non-RefSeq transcripts 
(Table 2; Supplemental Table 5). Adding the prematurely truncated 
transcript at Sponl, where full-length transcription is disrupted by 
an intronic ERV at a genomic distance exceeding 12.5 kb (Fig. 8), 
we anticipate that many more prematurely truncated transcripts 
will be associated with adjacent ERVs in future studies of distinct 
mouse tissues, developmental stages, and nonreference mouse 
strains. 

Discussion 

By developing the transposon junction assay with targeted deep 
sequencing, we mapped thousands of young ERVs in highly di- 
vergent mouse strains. The resulting catalog of ERV polymorphisms 
facilitated the identification of particular transcripts whose dif- 
ferential expression in the highly divergent mouse lineages could 
be attributed to them. Integrants that are identical (i.e., present at 
orthologous loci) across such widely divergent lineages represent 
ancestral retrotransposition events that are identical by descent 
(Salem et al. 2005; Ray et al. 2011), while the youngest integrants 
are likely to be lineage-specific and are highly polymorphic (Fig. 1; 



Zhang et al. 2008; Qin et al. 2010). The ERV polymorphisms occur 
mostly in extragenic chromosomal regions and are at particularly 
low densities within embryogenesis genes and genes that are 
highly expressed in ES cells (Fig. 2). When present, they are oriented 
mostly antiparallel to the genes' reading frames, extending pre- 
vious studies to previously unsequenced, highly divergent line- 
ages (Smit 1999; Medstrand et al. 2002; van de Lagemaat et al. 
2006; Zhang et al. 2008) and suggesting that they can disrupt gene 
expression and function. This distribution of polymorphic ERVs 
also suggests that remaining intronic integrants have survived 
purifying selection in the diverse mouse lineages over evolutionary 
time, because older elements are particularly depleted from in- 
tragenic sites. 

We characterized several genes whose usual expression is 
disrupted profoundly by polymorphic ERV integrants acting at 
a distance, both in cis and between alleles. We conclude that the 
ERVs themselves are the genetic determinants of transcriptional 
disruption occurring at a distance in cis, because of several obser- 
vations: (1) There is a strong and consistent association, across 
many mouse strains, between the presence of high levels of pre- 
maturely truncated transcripts and the presence of downstream, 
polymorphic, intronic ERVs. (2) Multiple independent genes at 
several different chromosomal positions exhibit similar effects. (3) 
Expression quantitative trait locus (eQTL) analysis in BxD re- 
combinant strains established very strong genetic associations 
between the ERV-containing genotype and disrupted transcript 
isoforms (Fig. 5). (4) There are no other polymorphic genomic 
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Figure 8. Disruption of additional genes by polymorphic, intronic ERVs in either orientation. (A) Genome structure of Polrla containing a polymorphic 
AS ERV in intron 20, present in A/J and B6 and absent from DBA/2J and CAST mice. Various PCR primers are shown; (ex) exon number; (S) sense; (A) 
antisense. (Red arrows and brackets) cDNAamplicons; (U) upstream; (N) nonterminated, i.e., full-length. (B) Premature Polrla termination occurs in brain 
and testis of A/J but not DBA mice. RT-PCR assays measured expression of upstream (U) versus nonterminated (N) transcripts, using ex16S and ex19A 
versus ex16S and ex22A primers, respectively. (Arrows) Differentially expressed, nonterminated transcripts. (Right) Loading control for spliced Hprt 
transcript assayed by RT-PCR. (C) Quantitative RT-PCR assay measuring relative differences between upstream and downstream Polrla transcript levels, 
i.e., prematurely terminated transcripts. (Error bars) Range of duplicates. (D) Parent-of-origin effect on nonterminated Polrla transcript levels in het- 
erozygous mice. RT-PCR assays measured expression of upstream (U) vs. nonterminated (N) transcripts, using exl 6S and exl 9A versus exl 6S and ex21 A 
primers, respectively. (Arrows) Differentially expressed, nonterminated transcripts. See Supplemental Figure 6. (£) Genomic structure of Sponl containing 
a polymorphic ERV in intron 6, present in A/J but not DBA mice. (F) Premature Sponl termination in brain and testis of A/J but not DBA mice. (Arrows) 
Differentially expressed, upstream (U) and nonterminated (N) transcripts shown by RT-PCR assays. Both upstream and particularly full-length Sponl 
transcripts are reduced in A/J mice (based on similar input RNA levels vs. DBA mice). (C) Quantitative RT-PCR assay measuring relative differences between 
upstream and downstream Sponl transcript levels, i.e., prematurely terminated transcripts. (Error bars) Range of duplicates. 
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Table 2. Identification of candidate transcripts truncated 
prematurely by ERV integrants 



Gene name 



120001 4 jl IRik 


Dtdl 


Phf8 


201 001 5L04Rik 


Dym 


Pkdll2 


201 0111 101 Rik 


E130309F12Rik 


Polrla 


251 0009E07Rik 


Enpp6 


Pot eg 


3300002l08Rik 


Epb4.1l2 


Prrxll 


4930430F08Rik 


Epsl5 


Qrsll 


6720457D02Rik 


Exoc6 


Rab3gap 1 


Abcg3 


Fancd2 


Ralgapal 


Acly 


Galk2 


Rhbdl2 


Adamts3 


GalntlO 


Rnfl57 


Agbl4 


Gimap5 


Sag 


Akrlcl4 


Gm4979 


Sema3d 


Angptl 


Golga3 


Sgipl 


Arfgef2 


Gpsml 


Sirt5 


Asb3 


Iars2 


Slcl5a2 


Atp6vlh 


\f\44 


S Id 7 a 5 


Bmx 


Iqca 


Slc20a2 


Cede 15 


Irak3 


Slc25a46 


Ccdc46 


Itgb3bp 


Slc38al 


Cdk5rapl 


Katnall 


Slco6bl 


Cenpq 


Klhll3 


Snx29 


Chll 


Lama3 


Sypl2 


Cmah 


Letml 


Tbcld22a 


Cog6 


Mapk4 


Tcfcp2 


Col4a4 


Me3 


Tmco3 


Ctps2 


Me3 


Timed 7 


Cyp20al 


Ml It 10 


Trpm2 


Dcplb 


Myoml 


Txndcl 1 


Dctn4 


Ophnl 


Uty 


Diap3 


0rai2 


Vps52 


Dnahcl 


Otoa 


Whscl 


Dnahc7b 


Paqr3 


Zfand3 


Dph5 


Parp4 


Zfp407 


Dsg2 


Phc3 





Listed here are candidate genes that may be disrupted by ERV (IAP) 
integrants found within 1 0 kb genomic distance downstream from pre- 
mature termination sites. This analysis focused on reference B6 genes 
because currently available mouse transcriptome data are limited mostly 
to this strain. Slcl5a2 (Figs. 3-7), Polrla (Fig. 8), and Cdk5rapl (Druker 
et al. 2004) were identified in this screen. Sponl is not listed here, al- 
though its transcription is disrupted by an intronic ERV (Fig. 8), because we 
limited this screen to identify only candidate ERVs <1 0 kb from the pre- 
mature termination site. 



features in cis that plausibly or consistently explain the observed 
expression differences. 

The prematurely terminated transcripts identified here read 
past canonical splice donor sites and appear to use pre-existing 
intronic polyadenylation signal sequences that otherwise are not 
used routinely (Fig. 4B). We hypothesize that the polymorphic 
ERVs dramatically alter the use of these splicing and termination 
signals, which are nonpolymorphic regardless of the integrants' 
presence or absence (Figs. 3B, 4A). Such alternative transcriptional 
processing can occur coordinately (Wang et al. 2008), as exempli- 
fied by lamins expressed from Lmnb2 (Furukawa and Hotta 1993) 
and LMNA (Lin and Worman 1993). Resulting transcripts con- 
taining intronic sequences at their 3' ends have been termed 
"composite exons" (Yan and Marr 2005). 

Similar transcriptional termination occurring at a distance 
has been attributed to an intronic ERV in Cdk5rapl (Druker et al. 
2004). To our knowledge, the prematurely truncated Cdk5rapl 
transcripts are the only previously reported case of transcriptional 
disruption occurring upstream of such an ERV integrant. However, 
significant variability in levels of both downstream and pre- 



maturely terminated upstream Cdk5rapl transcripts was de- 
scribed between individual animals. Thus, the Cdk5rapl locus 
was described as a "metastable epiallele," comparable to highly 
variable expression of A vy and Axinl Fu (Morgan et al. 1999; 
Whitelaw and Martin 2001). In contrast, we did not observe such 
a high degree of inter-individual variability at Slcl5a2 (Figs. 3, 6; 
Supplemental Fig. 5). 

Nonterminated (i.e., full-length) transcript levels at Slcl5a2 
and Polrla reflect the parent of origin of heterozygous ERVs (Figs. 
6, 8; Supplemental Figs. 3, 5, 6). When the intronic, heterozygous 
ERV was derived from the father, expression of nonterminated 
transcripts is reduced significantly from both alleles. In contrast, 
when the heterozygous ERV was maternally derived, full-length 
transcripts are expressed robustly, i.e., at levels similar to those in 
the ERVs absence. Regardless of the ERVs parent of origin and 
their overall expression, the nonterminated transcripts are ex- 
pressed from both allelic templates at approximately equivalent 
levels (Supplemental Figs. 3, 6). Previously, transposons have been 
implicated as targets for establishment of imprinting and differ- 
entially methylated regions at particular loci (Suzuki et al. 2007), 
although a detailed molecular mechanism was not described. We 
found that the 5' LTR of heterozygous ERV S/Cj5a2 is differentially 
methylated, depending on its parent of origin (Fig. 7). When in- 
herited from the father, the 5' LTR is densely methylated, whereas 
when it is maternally inherited its methylation is reduced. This 
differential epigenetic control appears to be associated with dif- 
ferential levels of nonterminated transcripts. Such silencing epi- 
genetic marks may mediate this parent-of-origin effect possibly by 
affecting transcriptional processivity past the ERV (Rebollo et al. 
2011), although this does not directly explain the effects between 
alleles that we observed. The ERVs 3' LTR is consistently methyl- 
ated, regardless of its parent of origin (Fig. 7). 

We did not observe a parent-of-origin effect in expression 
levels of prematurely truncated transcripts. Their expression ap- 
pears to be boosted whenever the ERV is present. Moreover, the 
expression of full-length and truncated transcripts is not always 
inversely correlated; they do not sum up to a constant level of 
upstream initiation and transcription (Fig. 6D; Supplemental Fig. 5). 
This implies that transcriptional initiation, prolongation, splicing, 
and premature termination may not be coordinately regulated in 
some cases, and instead may undergo independent, complex pat- 
terns of regulation. 

Several other distinct cases of transcriptional regulation and 
disruption illustrate multiple potential effects by ERVs on gene 
expression. An inverse association was reported between differ- 
ential DNA methylation of the Cdk5rapl intronic ERVs 5 ' LTR and 
premature termination of Cdk5rapl transcripts (Druker et al. 2004). 
Additionally, nonterminated Cdk5rapl transcripts are consistently 
expressed, independent of the terminated transcripts' variable 
expressivity. In axin fused mice, variable expression of downstream 
Axin transcripts has been associated with differential methylation 
of the 5' LTR of the intronic ERV in the Ax Fu allele, and either 
parent can transmit the epigenetic state (Rakyan et al. 2003). In 
contrast, in the A vy mouse, variable expressivity of nonagouti is 
related to differential methylation of the 5' LTR of an upstream 
ERV, and the epigenetic state is maternally inherited (Morgan et al. 
1999). Transcripts from the imprinted mouse gene HI 3 recently 
were found to undergo alternative polyadenylation that is regu- 
lated epigenetically by differential methylation of an internal 
promoter, albeit not in an ERV (Wood et al. 2008). Thus these 
various distinct expression patterns contrast with the transcrip- 
tional disruption described here. 
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Recently, bioinformatics analysis of the human transcriptome 
correlated antisense (AS) transcription with alternative splicing of 
overlapping sense-strand transcripts (Morrissy et al. 2011). The 
aberrantly spliced and terminated transcripts at Cdk5rapl (Druker 
et al. 2004) were attributed tentatively to possible AS transcription 
initiated from the intronic ERV promoter. However, evidence for 
AS transcription was not demonstrated, and the underlying mo- 
lecular mechanism remains unknown. Such a model for transcrip- 
tional interference, i.e., collisions of bidirectional RNA polymerase 
complexes (Eszterhas et al. 2002), plausibly could explain CdkSrapl 
disruption (Druker et al. 2004). However, transcriptional inter- 
ference would not explain transcript expression differences be- 
tween alleles as described here (Figs. 6, 8; Supplemental Figs. 3, 5, 
and 6). Alternatively, diffusible AS transcripts could act at long 
genomic distances, both in cis and between alleles. AS transcripts 
could affect host gene splicing by blocking Ul snRNA base-pairing 
with pre-mRNAs (Kaida et al. 2010). ERV-mediated alterations in 
gene "punctuation," where the polymorphic ERV could alter gene 
looping or interactions between homologous alleles by disrupt- 
ing long-range interactions between upstream gene promoters 
and various downstream terminator sites (Tan-Wong et al. 2008), 
could provide another explanation for transcriptional disrup- 
tion. Intragenic ERV integrants could introduce targets for het- 
erochromatin formation that could disrupt full-length transcription 
in cis. Other possibilities also are plausible (Wilusz and Spector 
2010). 

We identified about 100 intronic ERV candidates that may 
trigger premature transcriptional termination at a distance (Table 2), 
out of approximately 1025 genes displaying evidence for pre- 
mature termination. We speculate that other types of ERVs (i.e., 
ETn/ MusD elements) (Zhang et al. 2008), retrotransposons, or re- 
petitive elements similarly could trigger transcriptional truncation 
in at least some of the remaining —90% of genes lacking such 
intronic ERV candidates. We are addressing this interesting question 
currently. 

Extrapolating from the number of intronic ERV poly- 
morphisms identified in diverse mouse lineages, we estimate that 
up to —10% of all genes containing intronic ERVs exhibit tran- 
scriptional disruption mediated by the integrants acting at a dis- 
tance. This calculation may underestimate the full extent of ERV- 
mediated transcriptional disruption, since comprehensive trans- 
cipt expression data are lacking from various tissues and de- 
velopmental time points from the divergent strains studied here. 
On the other hand, most of these candidates have not been vali- 
dated by molecular assays. We did not detect a significant differ- 
ence in the relative orientation of intronic ERVs that appear to 
trigger premature truncation (i.e., —30% AS compared with over- 
lapping genes), when compared with all intronic ERVs (i.e., —23% 
AS, p = 0.102). We postulate that thousands of other intronic ERVs 
present in different lineages (Table 1) are unlikely to disrupt over- 
lapping gene expression and function in this way, because such 
genes presumably lack the pre-existing, weak polyadenylation or 
alternative splicing signals that could be boosted by them. De novo 
intronic ERV integrants that strongly affect gene transcription, 
particularly of essential genes, would be expected to be highly del- 
eterious, explaining their relative exclusion from embryogenesis 
genes and genes highly expressed in ES cells (Fig. 2). This conclusion 
is consistent with the demonstration that fusion transcripts are 
initiated from ERV LTR promoters in oocytes and in early embryo- 
genesis (Peaston et al. 2004). 

Our results strongly suggest that genome-wide studies based 
solely on SNP genotyping may miss important determinants of 



transcriptional variation and functional diversity. Comprehensive 
knowledge of all forms of structural variation within and between 
individuals, including indel polymorphisms caused by actively 
mobilized repetitive elements such as ERVs, will be critically im- 
portant to understand the molecular basis for phenotypic varia- 
tion (Li et al. 2010; Keane et al. 2011; Yalcin et al. 2011). Although 
ERVs appear to be inactive in humans, —10% of the genome is 
comprised of such elements, suggesting that similar transcrip- 
tional disruption could be mediated by their promoter activities. 
While too numerous and diverse to describe here, other trans- 
poson families and retroposed elements continue to be actively 
mobilized in both the mouse and human genomes, thereby also 
introducing promoter activities, new polyadenylation signal se- 
quences in cis (Li et al. 2010), new splicing sites, and targets for 
epigenetic regulation (Macfarlan et al. 2011; Monk et al. 2011). 
Further characterization of the molecular causes and conse- 
quences of transcriptional variation caused by genomic ERVs 
and other families of transposons and, in particular, the detailed 
mechanisms for premature transcriptional polyadenylation trig- 
gered at a distance undoubtedly will be promising areas for further 
study. 

Methods 

Mouse colony and genomic DNA 

Mice were maintained and euthanized according to approved In- 
stitutional Animal Care and Use Committee protocols (National 
Cancer Institute, Frederick, MD; and Ohio State University, Co- 
lumbus, OH). Mouse strains and purified genomic DNA were 
purchased from the Jackson Laboratory (Bar Harbor, ME). 

Bioinformatics tools and statistical analysis 

Alignments of pyrosequencing reads to the B6 reference genome 
assembly were performed using GMAP (Wu and Watanabe 2005; 
Akagi et al. 2008), BLAT, and BLAST. Results were parsed using 
custom Perl scripts with BioPerl modules. Statistical analyses were 
performed using SPSS (http://www.spss.com) or R software as de- 
scribed. Analysis of mouse ES cell expression data was based on 
public data sets in the GEO repository under accession number 
GSE8024 (Mikkelsen et al. 2007). Genes annotated as embryo- 
genesis genes were identified from the Mouse Genome Informatics 
database (http://www.informatics.jax.org). 

Further details about our assessment of possible bias in 
detecting ERVs using the transposon junction assay and our pro- 
cedures for identification of ERVs in four "Celera strains" are pro- 
vided in the Supplemental Material. To assess their chromosomal 
distributions (Supplemental Fig. 2), we counted retrotransposons 
in 500-kb bins genome-wide. Reference distributions of retro- 
transposons for each class (LI, ERV, and SINE) were obtained from 
the UCSC mm8 mouse reference assembly. Similarly, polymorphic 
retrotransposon distributions were determined by counting both 
unique insertions in reference and insertions in alternative strains 
we identified from four strains by Celera shotgun sequencing 
(Mural et al. 2002). 

To identify candidate genes with prematurely truncated 
transcripts (Table 2), we compared chromosomal coordinates of 
—20,180 RefSeq reference genes (UCSC Genome Browser) with 
those from the Known Gene track in the UCSC database. We 
compared annotated gene symbols in cases in which the genomic 
template for a Known Gene transcript is >20 kb shorter than that 
for a corresponding RefSeq gene transcript. In approximately 
1025 cases in which the assigned NCBI gene ID numbers are 
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identical, we called such a Known Gene transcript a truncated 
variant of the full-length RefSeq gene. In such cases, we scanned 
within 10 kb downstream from the 3' end of the truncated 
transcript for ERV elements (i.e., all IAP subtypes as defined by 
RepeatMasker). 



Identification of ERVs using transposon junction assay 

To map previously unsequenced ERV integrants in divergent 
mouse lineages, we developed a new high-throughput assay using 
nested PCR (Pornthanakasem and Mutirangura 2004) to amplify 
genomic sequences containing 3' junctions of transposon inte- 
grants, followed by deep 454 sequencing (Supplemental Fig. 1). 
Forward PCR primers were designed to anneal within young, 
highly conserved ERV integrant sequences; details are provided in 
the Supplemental Material. Resulting sequence traces were aligned 
to the mm8 mouse reference genome and analyzed for indel 
polymorphism status, using modifications of our sequence align- 
ment pipeline described in the Supplemental Material (Akagi et al. 
2008). To identify previously unsequenced ERV integrants, se- 
quencing reads from these PCR amplicons were mapped to the 
reference mouse genome assembly in three steps: preprocessing of 
reads, mapping of reads, and clustering of overlapping reads de- 
fining discrete insertion sites. We used this assay to identify pre- 
viously unsequenced ERV elements in six diverse mouse lineages, 
i.e., A/J, B6, CAST, MOLF, SPRET, and WSB. 



RNA isolation 

To preserve high-quality total RNAs for downstream transcriptome 
analysis, we collected tissues from both sexes of inbred mouse 
strains at day 72. RNAs were collected from strains B6, 129S1, 
129X1, DBA/2J, A/J, CAST, SPRET, MOLF, WSB, and intercrossed 
Fi hybrid offspring B6 X CAST, and CAST X B6, respectively. 
Trimmed tissues were immediately immersed in RNA Later 
(Ambion) and either snap-frozen at -80°C or transferred to 
TRIzol (Invitrogen), homogenized, and frozen. The quality of 
total RNAs was determined using a model 2100 Agilent bio- 
analyzer where >95% of the samples had RIN scores >9. RNA 
specimens isolated from the same strain, tissue, gender, etc. were 
pooled from at least five individuals, unless noted. 



Northern blots and RT-PCR assays 

Total RNAs from indicated mouse tissues were electrophoresed in 
agarose gels under standard conditions, transferred to charged 
nylon membranes (GE Amersham), and hybridized with radio- 
labeled DNA probes at the 5' and 3' ends, respectively, of Slcl5a2 
transcripts. Membranes were washed and exposed to film for au- 
toradiography. To synthesize first-strand cDNAs for reverse tran- 
scriptase-mediated polymerase chain reaction (RT-PCR) assays, 10 
juLg each of mouse total RNAs was primed for reverse transcription, 
using T7-anchored oligo(dT) 2 4 and Superscript II Reverse Tran- 
scriptase (Invitrogen). Gene-specific primers for Slcl5a2, Polrla, 
and Sponl were used to amplify resulting first-strand cDNAs. 
Products were assessed by agarose gel electrophoresis. Quantitative 
RT-PCR was performed using these cDNAs and Power SYBR Green 
PCR master mix (ABI) on a StepOnePlus instrument (ABI). To 
quantify relative expression of Polrla truncated transcripts (Fig. 8), 
we measured upstream and downstream transcript levels, calcu- 
lated the difference in PCR cycle numbers AAC T = (exl4S-exl5A) — 
(ex27S-ex28A), and then calculated linear differences as 2 AACt . 
Sponl premature truncation was measured similarly. Further de- 
tails are in the Supplemental Material. 



RACE 

5'- and 3 '-RACE analyses were performed using the 5/3 RACE Kit, 
second generation (Roche Applied Science), the FirstChoice RLM- 
RACE kit (Ambion), and primers for Slcl5a2 (DES2622, 5'-CTTC 
TGACAAGCACTCTGGAG-3 ') and Polrla (DES4410, 5'-TGGTCT 
CACCCTTTCTGTAACG-3 ') according to the kit manufacturers' 
protocols. 

Western blots and PEPT2 functional assay 

PEPT2 protein expression in tissues from individual mice was 
assayed by Western blots as described in the Supplemental Mate- 
rial. To assay PEPT2 protein functional activity, six B6 mice (three 
females, three males) and five DBA/2J mice (three females, two 
males) were injected via tail vein injection with 100 \xL of GlySar 
solution containing 5 |xCi of 14 C-GlySar (98 mCi/mmol, O.lmCi/mL; 
Moravek). Tissue concentrations of GlySar (nanomoles per gram 
of wet tissue) were calculated as described in the Supplemental 
Material (Ocheltree et al. 2005; Shen et al. 2007). 

Expression quantitative trait locus analysis 

Slcl5a2 expression data from kidneys of 53 BxD RI mouse strains 
were obtained from the Gene Network (http://www.genenetwork.org). 
Transcript levels were measured using the Affymetrix M430v2 
platform (database access code MA_M2_0806_R), which includes 
three 5/c25a2-specific probe sets. Two of these probe sets detect 
the 3' end of truncated transcripts (1424730_a_at, 1447808_s_at), 
and the other detects the 3' end of the full-length transcript 
(14171600_at) (Fig. 5B). To assess local strain genotypes B (B6) and 
D (DBA), 72 informative SNPs within 10 Mb on either side of 
Slcl5a2 were identified from the Gene Network. Expression levels 
for each genotype B and D were determined, and P-values were 
calculated using a t-test with multiple test correction according to 
the Holm method. For all three Slcl5a2 probe sets, the maximal 
-log(P-value) occurred at the SNP rs41 73858, the closest in- 
formative SNP to Slcl5a2 in as. Additional genomic sequence 
variants including SNPs and small indel polymorphisms flanking 
Slcl5a2 were identified in B6 and DBA/2J strains using Sanger In- 
stitute mouse genome sequencing data (http://www.sanger.ac.uk/ 
resources/mouse/genomes/; SNP 20110125 release REL1101 and 
indel20100713 release REL1007). 

Data access 

Sequences as indicated were assigned Genbank accession numbers 
JF495121-JF495122. All ERVs identified here are accessible via our 
MouselndelDB website at http://variation.osu.edu/ (Akagi et al. 
2010). 
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